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Abstract 

Algorithmic entropy can be seen as a special case of entropy as studied in 
statistical mechanics. This viewpoint allows us to apply many techniques 
developed for use in thermodynamics to the subject of algorithmic infor- 
mation theory. In particular, suppose we fix a universal prefix-free Turing 
machine and let X be the set of programs that halt for this machine. 
Then we can regard X as a set of 'microstates', and treat any function 
on X as an 'observable'. For any collection of observables, we can study 
the Gibbs ensemble that maximizes entropy subject to constraints on ex- 
pected values of these observables. We illustrate this by taking the log 
runtime, length, and output of a program as observables analogous to the 
energy E, volume V and number of molecules N in a container of gas. 
The conjugate variables of these observables allow us to define quantities 
which we call the 'algorithmic temperature' T, 'algorithmic pressure' P 
and 'algorithmic potential' \i, since they are analogous to the temper- 
ature, pressure and chemical potential. We derive an analogue of the 
fundamental thermodynamic relation dE = TdS — PdV + fj,dN, and use 
it to study thermodynamic cycles analogous to those for heat engines. We 
also investigate the values of T, P and \i for which the partition function 
converges. At some points on the boundary of this domain of convergence, 
the partition function becomes uncomputable. Indeed, at these points the 
partition function itself has nontrivial algorithmic entropy. 
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1 Introduction 



Many authors [H [Q [H [TU [TBI Ull HSl [28] have discussed the analogy between 
algorithmic entropy and entropy as defined in statistical mechanics: that is, 
the entropy of a probability measure p on a set X. It is perhaps insufficiently 
appreciated that algorithmic entropy can be seen as a special case of the entropy 
as defined in statistical mechanics. We describe how to do this in Section [3] 

This allows all the basic techniques of thermodynamics to be imported to 
algorithmic information theory. The key idea is to take X to be some version of 
'the set of all programs that eventually halt and output a natural number', and 
let p be a Gibbs ensemble on X . A Gibbs ensemble is a probability measure that 
maximizes entropy subject to constraints on the mean values of some observables 
— that is, real- valued functions on X. 

In most traditional work on algorithmic entropy, the relevant observable 
is the length of the program. However, much of the interesting structure of 
thermodynamics only becomes visible when we consider several observables. 
When X is the set of programs that halt and output a natural number, some 
other important observables include the output of the program and logarithm 
of its runtime. So, in Section [4] we illustrate how ideas from thermodynamics 
can be applied to algorithmic information theory using these three observables. 

To do this, we consider a Gibbs ensemble of programs which maximizes 
entropy subject to constraints on: 

• E, the expected value of the logarithm of the program's runtime (which 
we treat as analogous to the energy of a container of gas), 

• V, the expected value of the length of the program (analogous to the 
volume of the container), and 

• JV, the expected value of the program's output (analogous to the number 
of molecules in the gas). 

This measure is of the form 

= 1 -PE{x)-iV{v)-SN(x) 

y Z 

for certain numbers /3, 7, 5, where the normalizing factor 

Z e^ pE{x) ^ v{x) ~ SN{x) 

is called the 'partition function' of the ensemble. The partition function reduces 
to Chaitin's number when /3 — 0, 7 = In 2 and 5 — 0. This number is un- 
computable !6j. However, we show that the partition function Z is computable 
when > 0, 7 > In 2, and 6 > 0. 

We derive an algorithmic analogue of the basic thermodynamic relation 

dE = TdS - PdV + fidN. 

Here: 
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• S is the entropy of the Gibbs emsemble, 

• T = 1/(3 is the 'algorithmic temperature' (analogous to the temperature 
of a container of gas). Roughly speaking, this counts how many times you 
must double the runtime in order to double the number of programs in 
the ensemble while holding their mean length and output fixed. 

• P = 7//3 is the 'algorithmic pressure' (analogous to pressure). This 
measures the tradeoff between runtime and length. Roughly speaking, 
it counts how much you need to decrease the mean length to increase the 
mean log runtime by a specified amount, while holding the number of 
programs in the ensemble and their mean output fixed. 

• /i = — S//3 is the 'algorithmic potential' (analogous to chemical potential). 
Roughly speaking, this counts how much the mean log runtime increases 
when you increase the mean output while holding the number of programs 
in the ensemble and their mean length fixed. 

Starting from this relation, we derive analogues of Maxwell's relations and 
consider thermodynamic cycles such as the Carnot cycle or Stoddard cycle. For 
this we must introduce concepts of 'algorithmic heat' and 'algorithmic work'. 




Charles Babbage described a computer powered by a steam engine; we de- 
scribe a heat engine powered by programs! We admit that the significance of 
this line of thinking remains a bit mysterious. However, we hope it points the 
way toward a further synthesis of algorithmic information theory and thermo- 
dynamics. We call this hoped-for synthesis 'algorithmic thermodynamics'. 



2 Related Work 



Li and Vitanyi use the term 'algorithmic thermodynamics' for describing phys- 
ical states using a universal prefix-free Turing machine U. They look at the 
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smallest program p that outputs a description a; of a particular microstate to 
some accuracy, and define the physical entropy to be 

S A (x) = (k\n2)(K{x)+H x ), 

where K(x) = \p\ and H x embodies the uncertainty in the actual state given 
x. They summarize their own work and subsequent work by others in chapter 
eight of their book [T7J. Whereas they consider x = U(p) to be a microstate, 
we consider p to be the microstate and x the value of the observable U. Then 
their observables 0(x) become observables of the form 0(U(p)) in our model. 

Tadaki [27J generalized Chaitin's number Vt to a function tt D and showed 
that the value of this function is compressible by a factor of exactly D when 
D is computable. Calude and Stay [5] pointed out that this generalization was 
formally equivalent to the partition function of a statistical mechanical system 
where temperature played the role of the compressibility factor, and studied 
various observables of such a system. Tadaki [55] then explicitly constructed 
a system with that partition function: given a total length E and number of 
programs N, the entropy of the system is the log of the number of -E'-bit strings 
in dom(U) N . The temperature is 

1 AE 



In a follow-up paper [29], Tadaki showed that various other quantities like the 
free energy shared the same compressibility properties as il D . In this paper, 
we consider multiple variables, which is necessary for thermodynamic cycles, 
chemical reactions, and so forth. 

Manin and Marcolli [20] derived similar results in a broader context and 
studied phase transitions in those systems. Manin [18j [19] also outlined an 
ambitious program to treat the infinite runtimes one finds in undecidable prob- 
lems as singularities to be removed through the process of renormalization. In 
a manner reminiscent of hunting for the proper definition of the "one-element 
field" F un , he collected ideas from many different places and considered how 
they all touch on this central theme. While he mentioned a runtime cutoff as 
being analogous to an energy cutoff, the renormalizations he presented are un- 
computable. In this paper, we take the log of the runtime as being analogous 
to the energy; the randomness described by Chaitin and Tadaki then arises as 
the infinite-temperature limit. 

3 Algorithmic Entropy 

To see algorithmic entropy as a special case of the entropy of a probability 
measure, it is useful to follow Solomonoff [24] and take a Bayesian viewpoint. 
In Bayesian probability theory, we always start with a probability measure called 
a 'prior', which describes our assumptions about the situation at hand before 
we make any further observations. As we learn more, we may update this 
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prior. This approach suggests that we should define the entropy of a probability 
measure relative to another probability measure — the prior. 

A probability measure p on a finite set X is simply a function p: X — s> [0, 1] 
whose values sum to 1, and its entropy is defined as follows: 

S (P) = ~ X] \np(x). 

xex 

But we can also define the entropy of p relative to another probability measure 

q- 

This relative entropy has been extensively studied and goes by various other 
names, including 'Kullback-Leibler divergence' [13] and 'information gain' [25] . 

The term 'information gain' is nicely descriptive. Suppose we initially as- 
sume the outcome of an experiment is distributed according to the probability 
measure q. Suppose we then repeatedly do the experiment and discover its out- 
come is distributed according to the measure p. Then the information gained is 
S(p,q). 

Why? We can see this in terms of coding. Suppose X is a finite set of signals 
which are randomly emitted by some source. Suppose we wish to encode these 
signals as efficiently as possible in the form of bit strings. Suppose the source 
emits the signal x with probability p(x), but we erroneously believe it is emitted 
with probability q(x). Then S(p,q)/hi2 is the expected extra message-length 
per signal that is required if we use a code that is optimal for the measure q 
instead of a code that is optimal for the true measure, p. 

The ordinary entropy S(p) is, up to a constant, just the relative entropy in 
the special case where the prior assigns an equal probability to each outcome. 
In other words: 

S(p) = S(p,q Q ) + S(q ) 

when go is the so-called 'uninformative prior', with qo(x) — 1/\X\ for all x G X. 

We can also define relative entropy when the set X is countably infinite. As 
before, a probability measure on A is a function p: X — > [0, 1] whose values sum 
to 1. And as before, if p and q are two probability measures on X, the entropy 
of p relative to q is defined by 

xii q{x) 

But now the role of the prior becomes more clear, because there is no probability 
measure that assigns the same value to each outcome! 

In what follows we will take A to be — roughly speaking — the set of 
all programs that eventually halt and output a natural number. As we shall 
see, while this set is countably infinite, there are still some natural probability 
measures on it, which we may take as priors. 
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To make this precise, wc recall the concept of a universal prefix-free Turing 
machine. In what follows we use string to mean a bit string, that is, a finite, 
possibly empty, list of O's and Vs. If x and y are strings, let x\\y be the con- 
catenation of x and y. A prefix of a string z is a substring beginning with the 
first letter, that is, a string x such that z = x\\y for some y. A prefix-free 
set of strings is one in which no element is a prefix of any other. The domain 
dom(Af) of a Turing machine M is the set of strings that cause M to eventually 
halt. We call the strings in dom(Af) programs. We assume that when the M 
halts on the program x, it outputs a natural number M(x). Thus we may think 
of the machine M as giving a function M : dom(M) — » N. 

A prefix-free Turing machine is one whose halting programs form a 
prefix- free set. A prefix- free machine U is universal if for any prefix- free Turing 
machine M there exists a constant c such that for each string x, there exists a 
string y with 

U(y) = M(x) and \y\ < \x\ + c. 

Let U be a universal prefix-free Turing machine. Then we can define some 
probability measures on X = dom(£7) as follows. Let 

| • |: X -> N 

be the function assigning to each bit string its length. Then there is for any 
constant 7 > In 2 a probability measure p given by 

p(x) = |e- 7|a| . 

Here the normalization constant Z is chosen to make the numbers p{x) sum to 
1: 

Z = £ e -rM. 

It is worth noting that for computable real numbers 7 > In 2, the normalization 
constant Z is uncomputable 27J. Indeed, when 7 = In 2, Z is Chaitin's famous 
number fi. We return to this issue in Section 

Let us assume that each program prints out some natural number as its 
output. Thus we have a function 

TV: X -> N 

where N(x) equals i when program x prints out the number i. We may use this 
function to 'push forward' p to a probability measure q on the set N. Explicitly: 

q(i) - ]T e-^l . 

x£X:N(x)=i 

In other words, if i is some natural number, q(i) is the probability that a program 
randomly chosen according to the measure p will print out this number. 
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Given any natural number n, there is a probability measure S n on N that 
assigns probability 1 to this number: 



5 n {m) 



I if rn = n 
otherwise. 



We can compute the entropy of S n relative to q: 



S{S„,q) 



?(0 



( 




(2) 



= -In 



+ ln Z. 



x£X: 



N(x)=n 



Since the quantity InZ is independent of the number n, and uncomputable, it 
makes sense to focus attention on the other part of the relative entropy: 



If we take 7 = In 2, this is precisely the algorithmic entropy [16] of the 
number n. So, up to the additive constant InZ, we have seen that algorithmic 
entropy is a special case of relative entropy. 

One way to think about entropy is as a measure of surprise: if you can 
predict what comes next — that is, if you have a program that can compute it 
for you — then you are not surprised. For example, the first 2000 bits of the 
binary fraction for 1 /3 can be produced with this short Python program: 



But if the number is complicated, if every bit is surprising and unpredictable, 
then the shortest program to print the number does not do any computation at 
all! It just looks something like 

print "101000011001010010100101000101111101101101001010" 

Levin's coding theorem [15] says that the difference between the algorithmic- 
entropy of a number and its Kolmogorov complexity — the length of the 
shortest program that outputs it — is bounded by a constant that only depends 
on the programming language. 

So, up to some error bounded by a constant, algorithmic information is infor- 
mation gain. The algorithmic entropy is the information gained upon learning 
a number, if our prior assumption was that this number is the output of a ran- 
domly chosen program — randomly chosen according to the measure p where 
7 = In 2. 

So, algorithmic entropy is not just analogous to entropy as defined in statis- 
tical mechanics: it is a special case, as long as we take seriously the Bayesian 



-In 




print 



01" * 1000 
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philosophy that entropy should be understood as relative entropy. This real- 
ization opens up the possibility of taking many familiar concepts from thermo- 
dynamics, expressed in the language of statistical mechanics, and finding their 
counterparts in the realm of algorithmic information theory. 

But to proceed, we must also understand more precisely the role of the 
measure p. In the next section, we shall see that this type of measure is already 
familiar in statistical mechanics: it is a Gibbs ensemble. 

4 Algorithmic Thermodynamics 

Suppose we have a countable set X, finite or infinite, and suppose 
Cx, ■ ■ ■ , O n : X — > R is some collection of functions. Then we may seek a prob- 
ability measure p that maximizes entropy subject to the constraints that the 
mean value of each observable Ci is a given real number d: 



As nicely discussed by Jaynes [101 E], the solution, if it exists, is the so-called 
Gibbs ensemble: 



for some numbers Sj £ R depending on the desired mean values Cj. Here the 
normalizing factor Z is called the partition function: 



In thermodynamics, X represents the set of microstates of some physical 
system. A probability measure on X is also known as an ensemble. Each func- 
tion Ci : X — >• R is called an observable, and the corresponding quantity Sj is 
called the conjugate variable of that observable. For example, the conjugate 
of the energy E is the inverse of temperature T, in units where Boltzmann's 
constant equals 1. The conjugate of the volume V — of a piston full of gas, for 
example — is the pressure P divided by the temperature. And in a gas contain- 
ing molecules of various types, the conjugate of the number Ni of molecules of 
the zth type is minus the 'chemical potential' /i^, again divided by temperature. 
For easy reference, we list these observables and their conjugate variables below. 



(siCl(x) + -+S„C n (x)) 




THERMODYNAMICS 



Observable 


Conjugate Variable 


energy: E 


1 

T 


volume: V 


P 
T 


number: N{ 


Mi 
T 



Now let us return to the case where X = dom(£/). Recalling that programs 
are bit strings, one important observable for programs is the length: 

| • |: X ->N. 

We have already seen the measure 

P (x) = i e -^l. 

Now its significance should be clear! This is the probability measure on programs 
that maximizes entropy subject to the constraint that the mean length is some 
constant £: 

E p( x ) m = L 

xex 

So, 7 is the conjugate variable to program length. 

There are, however, other important observables that can be defined for 
programs, and each of these has a conjugate quantity. To make the analogy 
to thermodynamics as vivid as possible, let us arbitrarily choose two more ob- 
servables and treat them as analogues of energy and the number of some type 
of molecule. Two of the most obvious observables are 'output' and 'runtime'. 
Since Levin's computable complexity measure |14) uses the logarithm of runtime 
as a kind of 'cutoff' reminiscent of an energy cutoff in renormalization, we shall 
arbitrarily choose the log of the runtime to be analogous to the energy, and 
denote it as 

E: X -)• [0,oo) 

Following the chart above, we use 1/T to stand for the variable conjugate to E. 
We arbitrarily treat the output of a program as analogous to the number of a 
certain kind of molecule, and denote it as 

N: X ->• N. 

We use — fi/T to stand for the conjugate variable of N. Finally, as already 
hinted, we denote program length as 

V: X -> N 
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so that in terms of our earlier notation, V(x) = \x\. We use P/T to stand for 
the variable conjugate to V. 



ALGORITHMS 



Observable 


Conjugate Variable 


log runtime: E 
length: V 
output: N 


1 

T 
P 
T 

M 
T 



Before proceeding, we wish to emphasize that the analogies here were chosen 
somewhat arbitrarily. They are merely meant to illustrate the application of 
thermodynamics to the study of algorithms. There may or may not be a specific 
'best' mapping between observables for programs and observables for a container 
of gas! Indeed, Tadaki [28] has explored another analogy, where length rather 
than log run time is treated as the analogue of energy. There is nothing wrong 
with this. However, he did not introduce enough other observables to see the 
whole structure of thermodynamics, as developed in Sections 14. 1114. 21 below. 

Having made our choice of observables, we define the partition function by 

i(£(i)+PV(i)-/iN(i)) 



Z 



xEX 



When this sum converges, we can define a probability measure on X , the Gibbs 
ensemble, by 



p(x) 



1 

Z< 



±(E(x)+PV(x)-tiN(x)) 



Both the partition function and the probability measure are functions of T, P 
and [i. From these we can compute the mean values of the observables to which 
these variables are conjugate: 



e = j2p( x ) e ( x ) 



x£X 



V = ^2p(x)V(x) 

x€X 

N = ^2p(x)N(x) 

xex 

In certain ranges, the map (T, P, fi) H> (E, V, N) will be invertible. This allows 
us to alternatively think of Z and p as functions of E, V, and N. In this 
situation it is typical to abuse language by omitting the overlines which denote 
'mean value'. 
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4.1 Elementary Relations 

The entropy S of the Gibbs ensemble is given by 

S = - 2_, p(x) \np(x). 

We may think of this as a function of T, P and fi, or alternatively — as explained 
above — as functions of the mean values E, V, and N. Then simple calculations, 
familiar from statistical mechanics 122) . show that 



dS_ 

dE 

dS_ 
dV 

8S 
~dN 



V,N 



E.N 



E,V 



1 

T 

P 
T 

T' 



(3) 
(4) 
(5) 



We may summarize all these by writing 



dS 



1 



-dE 



P 



-dV 



-dN 



or equivalcntly 



dE = TdS - PdV 
Starting from the latter equation we see: 



T 



[idN. 



dE 
~dS 

dE 
dV 

dE 
dN 



= T 



V.N 



S,N 



s.v 



-p 



(6) 



(7) 



(8) 



(9) 



With these definitions, we can start to get a feel for what the conjugate 
variables are measuring. To build intuition, it is useful to think of the entropy 
S as roughly the logarithm of the number of programs whose log runtimes, 
length and output lie in small ranges E ± AE, V ± AV and N ± AN. This is at 
best approximately true, but in ordinary thermodynamics this approximation 
is commonly employed and yields spectacularly good results. That is why in 
thermodynamics people often say the entropy is the logarithm of the number 
of microstates for which the observables E, V and N lie within a small range of 
their specified values [22] . 

If you allow programs to run longer, more of them will halt and give an 
answer. The algorithmic temperature, T, is roughly the number of times 
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you have to double the runtime in order to double the number of ways to satisfy 
the constraints on length and output. 

The algorithmic pressure, P, measures the tradeoff between runtime and 
length [4 : if you want to keep the number of ways to satisfy the constraints 
constant, then the freedom gained by having longer runtimes has to be counter- 
balanced by shortening the programs. This is analogous to the pressure of gas 
in a piston: if you want to keep the number of microstates of the gas constant, 
then the freedom gained by increasing its energy has to be counterbalanced by 
decreasing its volume. 

Finally, the algorithmic potential describes the relation between log run- 
time and output: it is a quantitative measure of the principle that most large 
outputs must be produced by long programs. 



4.2 Thermodynamic Cycles 

One of the first applications of thermodynamics was to the analysis of heat 
engines. The underlying mathematics applies equally well to algorithmic ther- 
modynamics. Suppose C is a loop in (T, P, (i) space. Assume we are in a region 
that can also be coordinatized by the variables E, V, N. Then the change in 
algorithmic heat around the loop C is defined to be 



AQ = J TdS. 

Suppose the loop C bounds a surface S. Then Stokes' theorem implies that 



AQ = f TdS = / dTdS. 
Jc Jt, 

However, Equation ^ implies that 

dTdS = d(TdS) = d(dE + PdV - fidN) = +dPdV - dfxdN 
since d 2 — 0. So, we have 

AQ = / {dPdV - dfidN) 

or using Stokes' theorem again 

AQ= / (PdV-fidN). (10) 
Jc 

In ordinary thermodynamics, N is constant for a heat engine using gas in a 
sealed piston. In this situation we have 

AQ = / PdV. 

Jc 

This equation says that the change in heat of the gas equals the work done on 
the gas — or equivalently, minus the work done by the gas. So, in algorithmic 
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thermodynamics, let us define J c PdV to be the algorithmic work done on our 
ensemble of programs as we carry it around the loop C. Beware: this concept 
is unrelated to 'computational work', meaning the amount of computation done 
by a program as it runs. 

To see an example of a cycle in algorithmic thermodynamics, consider the 
analogue of the heat engine patented by Stoddard in 1919 [25 . Here we fix N 
to a constant value and consider the following loop in the PV plane: 



(.Pi, vo 




V 



We start with an ensemble with algorithmic pressure Pi and mean length V±. 
We then trace out a loop built from four parts: 

1. Isometric. We increase the pressure from Pi to P2 while keeping the mean 
length constant. No algorithmic work is done on the ensemble of programs 
during this step. 

2. Isentropic. We increase the length from V\ to V2 while keeping the number 
of halting programs constant. High pressure means that we're operating in 
a range of runtimes where if we increase the length a little bit, many more 
programs halt. In order to keep the number of halting programs constant, 
we need to shorten the runtime significantly. As we gradually increase 
the length and lower the runtime, the pressure drops to P3. The total 
difference in log runtime is the algorithmic work done on the ensemble 
during this step. 

3. Isometric. Now we decrease the pressure from P3 to P4 while keeping the 
length constant. No algorithmic work is done during this step. 

4. Isentropic. Finally, we decrease the length from V2 back to V\ while 
keeping the number of halting programs constant. Since we're at low 
pressure, we need only increase the runtime a little. As we gradually 
decrease the length and increase the runtime, the pressure rises slightly 
back to P\. The total increase in log runtime is minus the algorithmic 
work done on the ensemble of programs during this step. 

The total algorithmic work done on the ensemble per cycle is the difference in 
log runtimes between steps 2 and 4. 
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4.3 Further Relations 



From the elementary thermodynamic relations in Section 14.11 we can derive 
various others. For example, the so-called 'Maxwell relations' are obtained by 
computing the second derivatives of thermodynamic quantities in two different 
orders and then applying the basic derivative relations, Equations (JTHH]) - While 
trivial to prove, these relations say some things about algorithmic thermody- 
namics which may not seem intuitively obvious. 

We give just one example here. Since mixed partials commute, we have: 



d 2 E 



dVdS 



d 2 E 



N 



dSdV 



N 



Using Equation ([7]), the left side can be computed as follows: 



d 2 E 




d 


dE 




dT 


dVdS 


N 


dV 


S,N ®S 


V,N 


dV 



S,N 



Similarly, we can compute the right side with the help of Equation ([5]): 



d 2 E 




d 


dE 




dP 




dSdV 


N 


OS 


v. N 9V 


S,N 


~ 7)S 


V.N 



As a result, we obtain: 



dT 
W 



S,N 



dP 
~dS 



V.N 



We can also derive interesting relations involving derivatives of the partition 
function. These become more manageable if we rewrite the partition function 
in terms of the conjugate variables of the observables E,V, and N: 



T 



= - 6 = -£. 



(11) 



Then we have 



e -PE(x)- 1 V(x)-SN(x) 



Simple calculations, standard in statistical mechanics [22,, then allow us to 
compute the mean values of observables as derivatives of the logarithm of Z 
with respect to their conjugate variables. Here let us revert to using overlines 
to denote mean values: 

^2p(x)E(x) 



E 



■bxZ 



xex 



xex 



x£X 



d/3 
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We can go further and compute the variance of these observables using second 
derivatives: 

x£X " 

and similarly for V and N. Higher moments of E, V and N can be computed 
by taking higher derivatives of In Z . 

4.4 Convergence 

So far we have postponed the crucial question of convergence: for which values of 
T, P and /i does the partition function Z converge? For this it is most convenient 
to treat Z as a function of the variables /3, 7 and S introduced in Equation ([XT]) . 
For which values of ,8,7 and 5 does the partition function converge? 

First, when /3 = 7 = S = 0, the contribution of each program is 1. Since 
there are infinitely many halting programs, Z(0, 0,0) does not converge. 

Second, when /3 = 0,7 = In 2, and (5 = 0, the partition function converges to 
Chaitin's number 

xex 

To see that the partition function converges in this case, consider this mapping 
of strings to segments of the unit interval: 



empty 





1 


00 


01 


10 


11 


000 001 


010 011 


100 101 


110 111 



Each segment consists of all the real numbers whose binary expansion begins 
with that string; for example, the set of real numbers whose binary expansion 
begins 0.101 ... is [0.101, 0.110) and has measure 2-l 101 l = 2~ 3 = 1/8. Since the 
set of halting programs for our universal machine is prefix-free, we never count 
any segment more than once, so the sum of all the segments corresponding to 
halting programs is at most 1. 

Third, Tadaki has shown [27] that the expression 

x£X 

converges for 7 > In 2 but diverges for 7 < In 2. It follows that Z(f3,j,S) con- 
verges whenever 7 > In 2 and /3,5 > 0. 

Fourth, when j3 > and 7 = 8 = 0, convergence depends on the machine. 
There are machines where infinitely many programs halt immediately. For these, 
Z(/3, 0, 0) does not converge. However, there are also machines where program 
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x takes at least V(x) steps to halt; for these machines Z(f3,0, 0) will converge 
when P > In 2. Other machines take much longer to run. For these, Z(/3,0,0) 
will converge for even smaller values of /3. 

Fifth and finally, when f3 — 7 = and i5 > 0, Z([3,j,S) fails to converge, 
since there are infinitely many programs that halt and output 0. 

4.5 Computability 

Even when the partition function Z converges, it may not be computable. The 
theory of computable real numbers was independently introduced by Church, 
Post, and Turing, and later blossomed into the field of computable analysis [2Tj . 
We will only need the basic definition: a real number a is computable if there 
is a recursive function that maps any natural number n > to an integer /(n) 
such that 

f(n) f(n) + 1 
<«< ■ 

n n 

In other words, for any n > 0, we can compute a rational number that approx- 
imates a with an error of at most 1/n. This definition can be formulated in 
various other equivalent ways: for example, the computability of binary digits. 
Chaitin [6] proved that the number 

Q = Z(0,ln2,0) 

is uncomputable. In fact, he showed that for any universal machine, the values 
of all but finitely many bits of Q are not only uncomputable, but random: 
knowing the value of some of them tells you nothing about the rest. They're 
independent, like separate flips of a fair coin. 

More generally, for any computable number 7 > In 2, Z(0,7,0) is 'partially 
random' in the sense of Tadaki [5J [57] . This deserves a word of explanation. A 
fixed formal system with finitely many axioms can only prove finitely many bits 
of Z(0,7, 0) have the values they do; after that, one has to add more axioms or 
rules to the system to make any progress. The number fl is completely random 
in the following sense: for each bit of axiom or rule one adds, one can prove at 
most one more bit of its binary expansion has the value it does. So, the most 
efficient way to prove the values of these bits is simply to add them as axioms! 
But for Z(0, 7, 0) with 7 > In 2, the ratio of bits of axiom per bits of sequence 
is less than than 1. In fact, Tadaki showed that for any computable 7 > In 2, 
the ratio can be reduced to exactly (ln2)/7. 

On the other hand, Z(/3, 7, S) is computable for all computable real numbers 
P > 0, 7 > In 2 and S > 0. The reason is that (3 > exponentially suppresses 
the contribution of machines with long runtimes, eliminating the problem posed 
by the undecidability of the halting problem. The fundamental insight here is 
due to Levin [14] . His idea was to 'dovetail' all programs: on turn n, run each 
of the first n programs a single step and look to see which ones have halted. As 
they halt, add their contribution to the running estimate of Z. For any k > 
and turn t > 0, let kt be the location of the first zero bit after position k in the 
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estimation of Z. Then because —f3E(x) is a monotonically decreasing function 
of the runtime and decreases faster than fc t , there will be a time step where the 
total contribution of all the programs that have not halted yet is less than 2~ ki . 

5 Conclusions 

There are many further directions to explore. Here we mention just three. First, 
as already mentioned, the 'Kolmogorov complexity' [T^] of a number n is the 
number of bits in the shortest program that produces n as output. However, 
a very short program that runs for a million years before giving an answer is 
not very practical. To address this problem, the Levin complexity [T3] of n 
is defined using the program's length plus the logarithm of its runtime, again 
minimized over all programs that produce n as output. Unlike the Kolmogorov 
complexity, the Levin complexity is computable. But like the Kolmogorov com- 
plexity, the Levin complexity can be seen as a relative entropy — at least, up to 
some error bounded by a constant. The only difference is that now we com- 
pute this entropy relative to a different probability measure: instead of using 
the Gibbs distribution at infinite algorithmic temperature, we drop the tem- 
perature to In 2. Indeed, the Kolmogorov and Levin complexities are just two 
examples from a continuum of options. By adjusting the algorithmic pressure 
and temperature, we get complexities involving other linear combinations of 
length and log runtime. The same formalism works for complexities involving 
other observables: for example, the maximum amount of memory the program 
uses while running. 

Second, instead of considering Turing machines that output a single natural 
number, we can consider machines that output a finite list of natural numbers 
(Ni, . . . ,Nj); we can treat these as populations of different "chemical species" 
and define algorithmic potentials for each of them. Processes analogous to chem- 
ical reactions are paths through this space that preserve certain invariants of the 
lists. With chemical reactions we can consider things like internal combustion 
cycles. 

Finally, in ordinary thermodynamics the partition function Z is simply a 
number after we fix values of the conjugate variables. The same is true in 
algorithmic thermodynamics. However, in algorithmic thermodynamics, it is 
natural to express this number in binary and inquire about the algorithmic 
entropy of the first n bits. For example, we have seen that for suitable values 
of temperature, pressure and chemical potential, Z is Chaitin's number fi. For 
each universal machine there exists a constant c such that the first n bits of the 
number f2 have at least n — c bits of algorithmic entropy with respect to that 
machine. Tadaki [37J generalized this computation to other cases. 

So, in algorithmic thermodynamics, the partition function itself has nontriv- 
ial entropy. Tadaki has shown that the same is true for algorithmic pressure 
(which in his analogy he calls 'temperature'). This reflects the self- referential 
nature of computation. It would be worthwhile to understand this more deeply. 
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